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Abstract 

Background: Variation in DNA copy number, due to gains and losses of chromosome segments, is common. A first 
step for analyzing DNA copy number data is to identify amplified or deleted regions in individuals. To locate such 
regions, we propose a circular binary segmentation procedure, which is based on a sequence of nested hypothesis 
tests, each using the Bayesian information criterion. 

Results: Our procedure is convenient for analyzing DNA copy number in two general situations: (1 ) when using data 
from multiple sources and (2) when using cohort analysis of multiple patients suffering from the same type of cancer. 
In the first case, data from multiple sources such as different platforms, labs, or preprocessing methods are used to 
study variation in copy number in the same individual. Combining these sources provides a higher resolution, which 
leads to a more detailed genome-wide survey of the individual. In this case, we provide a simple statistical framework 
to derive a consensus molecular signature. In the framework, the multiple sequences from various sources are 
integrated into a single sequence, and then the proposed segmentation procedure is applied to this sequence to 
detect aberrant regions. In the second case, cohort analysis of multiple patients is carried out to derive overall 
molecular signatures for the cohort. For this case, we provide another simple statistical framework in which data 
across multiple profiles is standardized before segmentation. The proposed segmentation procedure is then applied 
to the standardized profiles one at a time to detect aberrant regions. Any such regions that are common across two or 
more profiles are probably real and may play important roles in the cancer pathogenesis process. 

Conclusions: The main advantages of the proposed procedure are flexibility and simplicity. 

Keywords: Bayesian information criterion, Circular binary segmentation, Consensus molecular signature, Overall 
molecular signature, Variation in DNA copy number 



Background 

Copy number variations (CNVs) in DNA, due to gains 
and losses of chromosome segments, is common among 
healthy individuals and an important feature of tumor 
genomes. In healthy individuals, CNVs (most of which are 
inherited) are usually short and spaced far apart, whereas 
in tumor subjects, they can be quite long, sometimes 
spanning entire chromosomes. Because genomic instabil- 
ity can trigger the overexpression or activation of onco- 
genes and the silencing of tumor suppressors, mapping 
regions of common genomic aberrations have been used 
to discover cancer-related genes. Understanding genome 
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aberrations is important for a basic understanding of can- 
cer, as well as for diagnosis and clinical practice [1,2]. 
CNVs from cancer tissues, referred to as copy num- 
ber aberrations (CNAs), are acquired somatic aberrations 
most often observed only in cancer tissues. There is signif- 
icant interest in locating CNVs in normal individuals and 
CNAs in tumor subjects [3]. 

Various microarray technologies, including array com- 
parative genomic hybridization (aCGH), Affymetrix 
single-nucleotide polymorphism (SNP) genotyping 
arrays, Illumina Infinium arrays, and other SNP arrays, 
are used to investigate the roles of CNVs/CNAs. Here we 
describe aCGH in detail [4,5]. In this technique, DNA 
from a test sample and a normal reference sample are 
labeled differentially, using different fluorophores, and 
hybridized to several thousand spots on microarray chips. 
The spots are derived from most of the known genes and 
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non-coding regions of the genome, printed on a glass 
slide. The recorded value for each probe in a given sample 
is usually the log 2 ratio of the copy number measurement 
at the probe to its reference value, often computed from 
a set of normal population controls. The log 2 ratio of 
the normal state, in which the copy number of the target 
agrees with that of the control, should have a mean equal 
to zero. A contiguous stretch of measurements that are 
on average higher or lower than zero suggests a gain or 
loss in copy number. 

The analysis of DNA copy number data consists of iden- 
tifying amplified or deleted regions in each individual. 
There can be multiple CNVs/CNAs in a chromosome 
from a single sample. The binary segmentation proce- 
dure proposed by Vostrikova [6] has been widely used 
for locating multiple change-points. In each stage of this 
procedure, a single-change-point model is compared to 



a constant model with no change-points. Thus, the pro- 
cedure is easily implemented and circumvents the com- 
putational complexity normally faced in problems with 
a variable number of change-points. A potential prob- 
lem with the binary segmentation procedure is that it 
cannot detect a small segment buried in the middle of 
a large segment. Olshen et al. [7] modified the binary 
segmentation procedure to compare a model with a pair 
of change-points to a constant model with no change- 
points in each stage. This modified procedure is called 
circular binary segmentation, which is particularly use- 
ful for detecting short regions of a chromosome [7]. This 
approach recursively splits chromosomes into segments 
based on a statistic similar to the Student statistic, whose 
p-value is estimated by a time-consuming permutation 
process. To locate multiple CNVs/CNAs, we propose 
using circular binary segmentation based on a sequence 
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Figure 1 The breast cancer S1514. (a) The points are normalized log 2 ratios. The BACs are ordered by position in the genome, beginning at 1 p 
and ending at Xq. The inserts are chromosome numbers. The borders between chromosomes are represented by vertical bars, (b) Our analysis of 
S1 5 14. The normal state, where the copy number in the target agrees with that in the control, should have mean 0. A contiguous range of 
measurements whose average is higher or lower than 0 suggests a respective gain or loss in copy number. To identify gains and losses, we used 
r + = 0.3 and r~ = —0.3, respectively. We found single-copy duplication from the center to the end of chromosome 3. We identified double-copy 
duplication at the beginning of chromosome 5 and single-copy duplication in the remaining region of chromosome 5. We also identified very 
high-level amplification from the center to the end of chromosome 20. We found low-level losses on chromosome 13. The red lines indicate the 
mean values among the probes in segments detected by our method. 
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of nested hypothesis tests, each using the Bayesian infor- 
mation criterion (BIC) [8]. Note that our version is based 
on the existing circular binary segmentation strategy, but 
the proposed BIC is computationally simple, and is differ- 
ent from previous methods. Various authors [9-11] have 
suggested a BIC criterion for determining the number of 
change-points. 

In Methods Section, we describe the derivation of the 
proposed procedure and present a numerical example and 
simulation study. The proposed procedure can be flexibly 
adapted to analyze multiple DNA copy number data sets 
to discover both consensus and overall molecular signa- 
tures. In Results Section, these two general situations are 



separately discussed in "Integration of multiple platforms" 
and "Cohort analysis of multiple individuals" 

Methods 

Let Xi denote the log 2 ratio of the copy number mea- 
surement at the i-th probe of an individual. The vector 
X = (xi, . . . ,x m ) is then a DNA copy number data set for 
one chromosome of the individual, arranged according to 
genomic order along the chromosome. 

For a given threshold t + > 0, we construct a Bernoulli 
data set A = (a\, . . . , a m ) for gain events such that 

at = 1 if Xi > t + and a,- = 0 otherwise. (1) 
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Figure 2 Our analysis of the breast cancer S1 51 4. The borders between chromosomes are represented by vertical bars. The red lines indicate the 
mean values among the probes in segments obtained by the proposed procedure. As we increase the value of r+, higher-level gains are readily 
identifiable, (a) For r + = 0.2, we identified single-copy duplication at the end of chromosomes 3 and 1 1. We identified double-copy duplication at 
the beginning of chromosome 5 and single-copy duplication in the remaining region of chromosome 5. We also identified very high-level 
amplification from the center to the end of chromosome 20. (b) For t+ = 0.3, we found single-copy duplication from the center to the end of 
chromosome 3. We identified double-copy duplication at the beginning of chromosome 5 and single-copy duplication in the remaining region of 
chromosome 5. We also identified very high-level amplification from the center to the end of chromosome 20. We did not identify alternations on 
chromosome 1 1 , due to low-level amplified signals, (c) For r+ = 0.5, we identified double-copy duplication in the center of chromosome 20. At the 
end of chromosome 20, we identified very high-level gain (more than triple-copy). We identified double-copy duplication at the beginning of 
chromosome 5 and single-copy duplication in the remaining region of chromosome 5. We did not identify alternations on chromosomes 3 and 1 1, 
due to low-level amplified signals. 
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Figure 3 Our analysis of the breast cancer S1 514. The borders between chromosomes are represented by vertical bars. The red lines indicate the 
mean values among the probes in segments obtained by the proposed procedure, (a) For r~ = —0.3, we identified single-copy loss at 
chromosome 1 3. (b) For r~ = —0.4, we identified single-copy loss at chromosome 1 3. 



In a hypothetical situation for aCGH, Pollack et al. [12] 
specified log 2 0.8 < log 2 ratio < log 2 1.2 (-0.32 to 0.26) for 
the normal state, log 2 1.2 < log 2 ratio < log 2 2.0 (0.26 to 1) 
for low amplification, log 2 2.0 < log 2 ratio < log 2 3.0 (1 to 
1.58) for medium amplification, and log 2 ratio > log 2 3.0 
(=1.58) for high amplification. To locate low, medium, and 
high amplification, we would use r + =0.32, 1, and 1.58, 
respectively. If there are gain events in the target chromo- 
some of the individual, we expect to see many consecutive 
Is in A 



Table 1 Power for various t + and c 
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The numbers represent proportions of 1 000 predetermined elevated regions 
identified by our method. 



For a given threshold r < 0, we create D = 
(d\, . . . , d m ) such that 

di = 1 if Xi < t~ and di = 0 otherwise. (2) 

Pollack et al. [12] also specified log 2 ratio < log 2 0.8 (=- 
0.32) for loss. We would use t~ = —0.32. If there are loss 
events in the target chromosome, we expect to see many 
consecutive Is in D. 

The search for gain events is performed separately from 
that for loss events. To detect gain (loss) regions for an 
individual, we apply the following procedure to A (£>). 

Circular binary segmentation procedure 

We assume that the success rate for a Bernoulli data set A 
at probe location i changes according to 

K+l 

p(t) = ^ S(i e[ c*-i + 1, c k ] ) p k 
k=l 
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where 8(E) is the indicator function for event E and 
0 = Co < c\ < ■ ■ ■ < ck < CK+i = m are the 
unknown integer-valued change-points with associated 
success rates p\, . . . ,Pk+i- The goal of the change-point 
problem is to identify the number of change-points K, the 
change-points C\,..., ck, and the associated success rates 
pi,... ,Pk+1- 

We let Mo denote the constant model with no change- 
points (i.e. 0o = pi = • • • = Pm)- In Mq, the likelihood is 



Em 



*< ( i_0 o) Er =1 a- 



(3) 



Using the circular binary segmentation procedure, we 
reduce the complexity of the problem by assuming that the 
segment forms a circle. We test the hypothesis that the arc 
from c\ + 1 to ci and its complement have different success 
rates. Let Mi denote the change-point model given by a 
pair of c\ and ci. This implies that 9\ = p\ = ■ ■ ■ =p Cl = 



Pc 2 +i = ■ ■ ■ = Pm # Pa+1 = ■ ■ ■ = Pc 2 = °2, where 
1 < c\ < C2 < w. In Mi, the likelihood is 



Li(fii,c 2 ,9i,e 2 \A) = e 1 



c 2 + l ' 



E,ii(i-«/)+E£ c , +1 (i-«f) 



x (1 - 0i) 
xep ci+lfli (l-fl 2 )^i« (1 - fli) . 

(4) 

Let us consider the constant model Mq. The likelihood 
function (3) is maximized by 6q = YlJLi &j/m> giving 
Lq(6q\A). For M\, the likelihood (4) is maximized along 
1 < c\ < C2 < m via 

i=i a i + Zw= C2 +i <H 



(dl(ci,C2),0 2 (ci,C2)j 



m — C2 + C\ 
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Figure 4 Consensus estimate. The points are normalized log 2 ratios in the 33-42-mb section on chromosome 3 of theTCGA-02-0104 sample from 
(a) the Memorial Sloan-Kettering Cancer Center (MSKCC) and (b) Harvard Medical School. The red lines indicate the mean values among the probes 
in segments obtained by the proposed circular binary segmentation procedure. Panel (c) shows a multi-platform consensus based on the 
combined data sets of Memorial Sloan-Kettering Cancer Center and Harvard Medical School. We found two more small segments, located in the 
38.4-mb and 40.2-mb regions, which are indicated by the arrows. These were not identified in (a) and (b). 



Yang BMC Bioinformatics 201 2, 1 3:277 
http://www.biomedcentral.eom/1 471 -21 05/1 3/277 



Page 6 of 13 



The fully maximized likelihood in the segmentation 
model Li(ci,C2,Oi(ci,C2):02(ci,C2)\A) is then obtained 
by maximizing Li(ci, C2, 9\{c\, c 2 ), &2(ci, C2)\A) over the 
finite set 1 < c\ < C2 < m. 

We choose between Mo and Mi in accordance with the 
BIC. We define 

BICio = logLi(ci,C2,$i(ci,C2),fMci,c 2 )|yl) 

- log L 0 (§ 0 \A) - -(qi- ^o)logw (5) 

where the last term in (5) is a penalty function that adjusts 
for the difference in dimensionality between the two mod- 
els. In this application, q\ = 4 and qo = 1. If BICio is 
negative, the decision is to accept Mo. If BICio is posi- 
tive, we reject the constant model and estimate the first 
segment given by the pair of c\ and C2- 

To test Mq versus M\, the procedure begins by set- 
ting c\ = 1 and C2 = m. Let BIC^ be the observed 
BICio, and [c\ + 1,C2] be the corresponding interval. If 
BIC?q S < 0, we choose Mq, estimate the constant success 



rate to be p(i) = p\ for i e[l,m] with p\ = YlfLi a j/ m > 
and stop. If BIC^ > 0, [l,cj], [c x + l,c 2 ], and [c 2 + l,m] 
are recursively scanned using the same procedure. The 
recursion stops when none of the subregions contains its 
corresponding BIC°g S > 0. 

Application to aCGH data 

Snijders et al. [5] used aCGH to detect low-level DNA 
copy number gains and losses, as well as high-level ampli- 
fications for breast cancer specimen S1514. Their array 
contained 2276 probes for the mapped bacterial artificial 
chromosomes (BACs), which are large segments of DNA, 
typically 100 to 200 kilo-bases. Figure 1(a) shows a plot 
of the normalized log 2 ratios of S1514. Low-level gains 
and losses, as well as high-level amplifications were found 
inS1514. 

In Figure 1(b), we respectively use r + = 0.3 in Equation 
(1) and r~ = —0.3 in Equation (2) to identify gains 
and losses. Our procedure was executed to detect aber- 
rated regions for each of the 23 chromosomes. The red 
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Figure 5 Comparison of the proposed method, circular binary segmentation, CSGseq, and GLAD. The points are based on the combined 
log 2 ratios from Memorial Sloan-Kettering Cancer Center and Harvard Medical School. The top panel shows our segments. The last three panels 
show segments from circular binary segmentation, CSGseq, and GLAD. The red lines indicate the mean values among the probes in segments. 



Yang BMC Bioinformatics 201 2, 1 3:277 
http://www.biomedcentral.com/1471-2105/13/277 



Page 7 of 13 



lines indicate the mean values among clones in segments 
obtained by our procedure. We found gains on chromo- 
somes 3 and 5, loss on chromosome 13, and high-level 
amplification on chromosome 20. 

As we increase r + , higher-level gains are readily 
identifiable, as shown in Figure 2. As we decrease 
t - , lower-level losses are readily identifiable, as shown 
in Figure 3. From Figure 2 and Figure 3, ampli- 
fied and deleted regions of an individual are clearly 
separated, because these regions would trigger the 
activation of oncogenes and the silencing of tumor 
suppressors, respectively. 

Simulation study 

We evaluated the performance of our algorithm. The data 
to be segmented were generated from the model Xi ~ 
N(fii, 1), 1 < i < m, where m is the number of probes and 
i± denotes the mean. Let //, = c when I < i < I + k, and 
/Xj = 0 otherwise. The mean parameter c was set equal to 
1, 2, or 3. The value c = 1 represents low-level amplifica- 
tion. The values c = 2 and c = 3 represent moderate and 
high-level amplification, respectively. We simulated 1000 
data sets from 500 probes using this simulation setup. 

We randomly selected k from (3,. . . ,30), and / from 
(1,2, ... ,m — k). The values of / and k control the loca- 
tion of the change and the width of the changed segment, 
respectively. Note that the width of the changed seg- 
ment is at least 3 probes. Each data set had one elevated 
region ranging from 3-30 probes, and the elevation varied 
according to c. 



The power is the proportion of data sets in which the 
estimated change-points equal the true change-points. 
Table 1 lists the power for various c and r + . The power 
was lower for c = 1 because c = 1 represents low-level 
amplification. However, it increased as c increased. 

When r + < 2.0, we identified low- and higher-level 
amplification, and thus the power was high. In contrast, 
when r + > 2.5, we only observed higher-level ampli- 
fication as r + increased, and consequently the power 
was lower. 

Results 

Integration of multiple platforms 

Several sources (platforms, analytical methods, and labs) 
were used to study the variation in copy number of the 
same individual. Their profiles may have different mean 
levels of copy number aberrations and different noise lev- 
els [13,14]. They may also have different numbers of loci 
and variable coverage in different parts of the genome. If 
data sets from several sources are analyzed individually, 
it is difficult to reach a consensus when they disagree on 
the identity of a CNV/CNA. Combining data sets may 
increase resolution, facilitating the discovery of genes and 
probes that are important in the individual. To derive a 
consensus molecular profile, we combine multiple sources 
into a single sequence, and then apply our procedure to 
this sequence. 

The observed data constitute a two-dimensional array 
Xij for i = 1, . . . , mj and /' = 1, . . . , n, where Xy is the 
data point at the i-th probe and the /-th source, and « 




Chromosome 

Figure 6 The fibroblast cell line GM1 3300 has known alterations only on chromosomes 1 and 4. The points are normalized log 2 ratios. The 
borders between chromosomes are represented by vertical bars. 
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is the total number of sources. For the j-th source, rrij 
probes are ordered by chromosome location (ty, . . ., t m .j), 
which may have variable loci and coverage in parts of the 
chromosome. 

For a given threshold T. > 0, an indicator variable 
fly is defined to classify the DNA copy number level as 
increased or not; i.e., 

fly = 1 if xij > Xj and fly = 0 otherwise. (6) 

We then construct a Bernoulli data set Aj = 
(ay, . . . , a mj j) for each source /. Because different 
sources exhibit different degrees of attenuation of the 
true DNA copy number, we use a threshold for 
each source, rather than applying a common thresh- 
old to all sources. Note that we do not require pre- 
standardization of different sources. We keep these 
sequences ordered according to chromosome position, 
and integrate (tn> ■ ■ ■> t mi i)> ■ ■ ■> (tin, ■ ■ ■> t m „n) into a sin- 
gle sequence, which is the union of the chromosomic 
locations of probes from all profiles. Then A\, ...,A n 



are integrated into A along the single sequence. A pro- 
vides a consensus molecular profile and higher resolution 
for detecting CNAs. If there are amplification events in 
the target chromosome, we expect to see many con- 
secutive Is in A. To identify amplification regions, we 
apply the proposed procedure to A, as discussed in 
Methods Section. 

The search for loss events is performed separately from 
that for gain events. For a given threshold xj~ < 0 and 
for each source /, du is defined to classify the DNA copy 
number level as decreased or not: 

dij = 1 if < xj~ and dg = 0 otherwise. (7) 

We then construct a Bernoulli data set Dj = 
(dy, . . . , d m ,j) for j = 1, . . . , n, and D\, . . . ,D n are inte- 
grated into D along the integrated single sequence. To 
identify deletion regions for the individual, we apply the 
proposed procedure to D. 
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Figure 7 Our analysis of GM1 3300. The fibroblast cell line GM13300 has known alterations only on chromosomes 1 and 4. The points are 
normalized log 2 ratios, and the lines indicate the mean values among the points in segments obtained by our method, (a) CNVs of GM1 3300 on 
BAC clones from chromosome 1. (b) CNVs of GM13300 on BAC clones from chromosome 4. 
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Application to The Cancer Genome Atlas data 

The Cancer Genome Atlas (TCGA) project (http://tcga- 
data.nci.nih.gov/tcga) is a collaborative initiative for a bet- 
ter understanding of cancer, using existing large-scale 
complete-genome technologies [15]. One of the tumor 
types studied is glioblastoma multiforma (GBM), which 
is a brain tumor. The TCGA-02-0104 (vials 01A) sam- 
ple is known to have a large number of copy number 
aberrations on chromosome 3 at different mean levels 
[13]. To provide an application to somatic CNAs, we 
analyze TCGA-02-0104 samples from two TCGA cen- 
ters: the Memorial Sloan-Kettering Cancer Center and 
Harvard Medical School. Both centers adopted Agilent 
CGH 244 K arrays, which have 236000 loci, 12.7 kb 
average between loci, and 60-mer probes. The different 
TCGA centers have identified aberrant regions indepen- 
dently of one another. It has been suggested that more 
accurate, precise, and higher-resolution results could be 
obtained if copy number estimates from the different sites 
were combined. 



The proposed procedures were separately applied to 
detect amplification or deletion in the 33-42-mb (start- 
end) region on chromosome 3. Figure 4(a) and 4(b) show 
the individual results of the Memorial Sloan-Kettering 
Cancer Center and Harvard Medical School, respectively. 
Here, we used = = 0.5 in Equation (6) and r-f = 
Tj" = —0.5 in Equation (7) because the two centers used 
the same Agilent platform. Figure 4(c) shows a consen- 
sus estimate along the integrated sequence. We found two 
short fluctuations, located in the 38.4-mb region and the 
40.2-mb region, as indicated by the arrows in the figure. 
Note that these two segments were not identified by the 
single-source analyses presented in Figure 4(a) and 4(b). 

In Figure 5, our results are compared with popular 
CNV segmentation algorithms including circular binary 
segmentation [7], CGH-seg [16], and GLAD [17]. Their 
segment results are obtained by a web-based tool, CGH- 
web [18]. All methods show that gain and loss regions 
are respectively 35-38 mb (3p22.2-3p22.3) and 38-40 mb 
(3p22.1-3p22.2). However, our method and circular binary 
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Figure 8 Our analysis of GM03563. The fibroblast cell line GM03563 has known alterations only on chromosomes 3 and 9. The points are 
normalized log 2 ratios, and the lines indicate the mean values among the points in segments obtained by our method, (a) CNVs of GM03563 on 
BAC clones from chromosome 3. (b) CNVs of GM03563 on BAC clones from chromosome 9. The first two clones with log 2 ratioRi -1 indicate a 
single-copy deletion. 
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segmentation [7] are sensitive to the detection of short 
segments in this example. 

Circular binary segmentation [7] based on permutation 
took 95 seconds to detect the segmentation results of a 
total of 1358 probes, as shown in Figure 5. In contrast, 
the proposed procedure based on BIC took less than 15 
seconds, where the computation was done on a 2.66 GHz 
Intel i5 core processor. 

Cohort analysis of multiple individuals 

We turn next to the cohort problem of discovering over- 
all molecular signatures. Each profile is obtained from a 
different individual with the same type of cancer, and is 
assayed on the same platform type. The observed data are 
a two-dimensional array xy for i = 1, . . . , m,j = 1, . . . , «, 
where xn is the data point at the /-th probe according to its 
genomic order along the chromosome, and the /-th indi- 
vidual profile. Note that m is the number of probes and n is 
the number of individuals. To derive overall molecular sig- 
natures, we provide a simple statistical framework, which 
standardizes data across multiple profiles before segmen- 
tation. Then, we analyze the standardized profiles one at a 
time to detect aberrant regions. 
We standardize xij. For each probe i, we let zy = 

^-^Si— J/y — 5=* for ] = l >---> n - 

Hence the z ; y have a common mean equal to 0 and a 
common variance equal to 1. An indicator variable ay is 
defined to classify the DNA copy number level for the /-th 



probe and /-th individual as increased or not; i.e., 

ay = 1 if Zij > Y + an d a ij = 0 otherwise. (8) 

For the following numerical example, we used y + = 3. 
A segment with probes deviating by three standard devi- 
ations from the mean value of all samples is likely to 
indicate true gain. For large y + , higher-level gains are 
readily identifiable. If there are gain events in the tar- 
get chromosome of the /-th individual (j = l,...,n), we 
expect to see many consecutive Is in Ai = (ay, . . . , a m j). 
To identify the amplification regions for the /-th individ- 
ual, we apply the proposed procedure to Aj, as discussed in 
Methods Section. When common amplified regions occur 
for more than one individual, the aberrations are probably 
real and important for cancer pathogenesis processes. 

The search for loss events is performed separately from 
that for gain events, dy is defined to classify the DNA copy 
number level for the /-th probe and the /-th individual as 
decreased or not; i.e., 

dy = 1 if Zy < y~ and dy = 0 otherwise. (9) 

For our numerical example, we used y~ = —3. If there 
are deletion events in the target chromosome of the /-th 
individual, we expect to see many consecutive Is in Dj = 
(dy, . . . , d m j). To identify the deletion regions of the /-th 
individual, we apply the proposed procedure to Dj. 



Table 2 Summarized results of applying the proposed framework to nine cell lines 



Cell line 


Chromosome (exact location) 


Aneuploidy type 


Our method 


GM03563 


3 (3q12-3qter) 


Trisomy 


0 




9 (9pter-9p24) 


Monosomy 


0 


GM05296 


10 (10q21-10q24) 


Trisomy 


0 




11 (1 1 pi 2-1 1 pi 3) 


Monosomy 


0 


GM01750 


9 (9pter-9p24) 


Trisomy 


0 




14(14pter-14q21) 


Trisomy 


0 


GM03134 


8(8q13-8q22) 


Monosomy 


0 


GM13330 


1 (1 q25-1 qter) 


Trisomy 


0 




4 (4q35-4qter) 


Monosomy 


0 


GM01535 


5 (5q33-5qter) 


Trismoy 


0 




12 (12q24-12qter) 


Monosomy 


X 


GM07081 


7(7pter-7q11.2) 


Trisomy 


0 




15 (15pter-15q1 1.2) 


Monosomy 


X 


GM13031 


17 (17q21.3-17q23) 


Monosomy 


0 


GM01524 


6(6q15-6q25) 


Trisomy 


0 



Of the 15 altered regions found by spectral karyotyping, 13 were identified by our method. The symbol O indicates that our method correctly identified the known 
alterations found by spectral karyotyping, whereas the symbol x indicates that it did not identify the region. Two unidentified regions appear in chromosome 1 2 on 
GM01535 and chromosome 15 on GM07081. For GM01535, the region is represented by only one probe, and single altered probes were not found. For GM07081, our 
result is consistent with that of Snijders et al. [5], in that no evidence of an alteration was seen in the GM07081 data. 
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Standardization across multiple samples provides a 
multi-sample summary for the overall molecular signa- 
tures. However, one drawback to this type of standard- 
ization is that it restricts inferences about increased and 
decreased DNA copy numbers relative to the mean of 
the samples under study. When most or all samples are 
either two-fold over-expressed or under-expressed rel- 
ative to normal tissue (i.e., a majority of the samples 
have identical increases or decreases), it is impossible to 
properly identify these aberrations using the proposed 
standardization. These situations are very rare, and most 
aberrant intervals appear only in some significant subset 
of the samples. When pooling data across multiple indi- 
viduals, not all samples are expected to carry the same 
aberrant regions. 

Application to fibroblast cell lines 

We applied our framework to the aCGH data presented 
by Snijders et al. [5]. The data were obtained from 



single-array experiments on 15 fibroblast cell lines. 
The data are available in Tables E to H at http://www. 
nature.com/ng/journal/v29/n3/suppinfo/ng754_Sl.html. 
Each array contains 2276 mapped BACs spotted in 
triplicate. Because spectral karyotyping has shown 
that aberrations occur within a particular chromo- 
some for each of GM01524, GM01535, GM01750, 
GM03134, GM03563, GM05296, GM07081, GM13031, 
and GM13330, we limited our analysis to these nine cell 
lines. The data from a typical cell line experiment, specif- 
ically from cell line GM13300, can be seen in Figure 6. 
The proposed procedure was employed to detect aber- 
rated regions for each of the 23 chromosomes. We used 
y + = 3 in Equation (8) and y~ = —3 in Equation (9), 
respectively. GM13300, shown in Figure 6, has known 
aberrations only on chromosomes 1 and 4. The results 
shown in Figure 7 are consistent with those of Snijders 
et al. [5], in that our framework correctly identified aber- 
rations only on chromosomes 1 and 4. Our procedure 



(a) 
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Figure 9 Genome-wide analysis of the breast cancer SI 514. (a) The points are normalized log 2 ratios. The BACs are ordered by position in the 
genome, beginning at 1 p and ending at Xq. The inserts are chromosome numbers. The borders between chromosomes are represented by vertical 
bars, (b) Genome-wide search of S1 5 14. To identify gains and losses, we used r + = 0.3 and r~ = —0.3, respectively. We found single-copy gain 
from the center of chromosome 20 to the end of 23 (chromosome X), and single-copy loss from the beginning of chromosome 13 to the end of 
chromosome 14. The red lines indicate the mean values among the probes in segments detected by our method. 
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also correctly identified aberrations on chromosomes 3 
and 9 of GM03563 (Figure 8). 

Of the 15 aberrated regions listed in Table 2, which were 
found by spectral karyotyping, 13 were identified by our 
framework. The two unidentified regions were on chro- 
mosome 12 (GM01535) and chromosome 15 (GM07081). 
The aberrated region on GM01535 is represented by only 
a single probe, and single aberrated probes cannot be 
found. For GM07081, our result is consistent with that 
of Snijders et al. [5], in that no evidence of an aberra- 
tion appears in the aCGH data. Therefore, our procedure 
found everything it should have found. For a particular 
cell line and chromosome, we define a false positive to 
be an aberration that is identified by our framework but 
is not detected by spectral karyotyping. Our procedure 
produced only one false positive, at chromosome 4 on 
GM01524, although we do not know that this is a real 
aberration that is undetectable by spectral karyotyping. 
Hence, our procedure was able to identify the aberra- 
tions with only a single false positive, whereas the circular 
binary segmentation method of Olshen et al. [7] produced 
at least nine false positives. Furthermore, the aberrations 
identified by our procedure perfectly matched the CNVs 
found via spectral karyotyping. 

Discussion 

Our procedure is versatile in the sense that only higher- or 
lower-level gains/losses are readily identifiable. In partic- 
ular, there are two interesting types of aberrated regions. 
The first of these is a spike, which is often a small region 
with extremely large or small log 2 ratios. Only spikes are 
readily identifiable when large positive values of r + and 
large negative values of r~ are used in Equations (1) and 
(2), respectively. The second type is a consistent gain or 
loss region, whose log 2 ratios may not deviate very much 
from 0, but tend to remain positive or negative over the 
greater region. Only lower-level gains are readily iden- 
tifiable when we define a new Bernoulli data set A = 
(«i, . . . , a m ) for a small positive value of r + and a positive 
value of 6, such that 

fl; = 1 if r + < Xi < r + + e and = 0 otherwise. 

Similarly, only lower-level losses are readily identifiable 
for a small negative value of r~ when we define a new 
Bernoulli data set D = (d\, . . . , d m ) such that 

di = 1 if t~ — e < Xi < r~ and d[ = 0 otherwise. 

We pointed out that our procedure lacks the ability to 
detect CNAs when a whole chromosome is duplicated or 
deleted. For example, in Figure 1, the elevated X chromo- 
some ratios of SI 514 reflect the male-female difference 
in the X chromosome copy numbers shown. These eleva- 
tions are known to be constant for single-copy gains on a 



complete X chromosome. Because there were no fluctua- 
tions on the elevated, complete X chromosome, our pro- 
cedure could not detect the aberrations when based on a 
chromosome-wide search. To detect aberrations spanning 
complete chromosomes, our procedure should be based 
on a genome-wide search, which uses all 23 chromosomes 
together. Figure 9 shows the genome-wide search, which 
properly identified single-copy duplication in the entire X 
chromosome. 

Conclusions 

To locate the aberrated regions in an individual, we pro- 
pose a circular binary segmentation procedure based 
on BIC, which is nonparametric in the sense that it 
does not rely on any assumptions regarding indepen- 
dence or underlying distributions. The procedure does 
not require data to be transformed with missing val- 
ues imputed or with extreme outliers truncated. At each 
stage of the procedure, we need only to compare a 
model with a pair of change-points to a constant model 
with no change-points. Thus the procedure is easy to 
implement, and circumvents the computational complex- 
ity we would normally face in problems with a vari- 
able number of change-points. The procedure can be 
flexibly adapted to analyze multiple DNA copy number 
data sets, to discover consensus molecular signatures or 
overall molecular signatures. Moreover, we provide two 
simple statistical frameworks appropriate for detecting 
these signatures. 

Competing interests 

The author declares that he have no competing interests. 
Acknowledgements 

This work was supported by the National Research Foundation of Korea(NRF) 
grant funded by the Korea government(MEST) (201 2-0005352). The author 
thanks the two reviewers for their constructive comments and suggestions. 

Received: 30 April 201 2 Accepted: 22 October 201 2 
Published: 30 October 201 2 

References 

1 . Lipson D, Aumann Y, Ben-Dor A, Linial N, Yakhini Z: Efficient calculation 
of interval scores for DNA copy number data analysis. J Comput Biol 

2006, 13:215-228. 

2. Sun W, Wright FA, Tang Z, Nordgard SH, Loo PV, Yu T, Kristensen VN, Perou 
CM: Integrated study of copy number states and genotype calls 
using high-density SNP arrays. Nucleic Acids Res 2009, 37:5365-5377. 

3. Shen J, Zhang N: Change-point model on nonhomogeneous Poisson 
processes with application in copy number profiling by 
next-generation DNA sequencing. Ann ApplStat 201 2, 6:476-496. 

4. Pollack JR, Perou CM, Alizadeh AA, Eisen MB, Pergamenschikov A, Williams 
CF, Jeffrey SS, Botstein D, Brown PO: Genome-Wide Analysis of DNA 
Copy-Number Changes Using cDNA Microarrays. Nat Genet 1999, 
23:41 -46. 

5. Snijders AM, Nowak N, Segraves R, Blackwood S, Brown N, Conroy J, 
Hamilton G, Hindle AK, Huey B, Kimura K, Law S, Myambo K, Palmer J, 
Ylstra B, Yue JP, Gray JW, Jain AN, Pinkel D, Albertson DG: Assembly of 
Microarrays for Genome-Wide Measurement of DNA Copy Number. 
Nat Genet 200 1 , 29:263-264. 

6. Vostrikova L: Detecting disorder in multidimensional random 
process. Soviet Math Dokl 1 981 , 24:55-59. 



Yang BMC Bioinformatics 201 2, 1 3:277 
http://www.biomedcentral.com/1471-2105/13/277 



Page 13 of 13 



7. Olshen AB, Venkatraman ES, Lucito R, Wigler M: Circular binary 
segmentation for the analysis of array-based dna copy number 
data. Biostatistics 2004, 5:557-572. 

8. SchwarzG: Estimating the dimension of a model. Ann Statist 1 978, 

6:461-464. 

9. Zhang NR, Siegmund D: A modified bayes information criterion with 
applications to the analysis of comparative genomic hybridization 
data. Biometrics 2007, 63:22-32. 

1 0 Yang TY, Kuo L: Bayesian binary segmentation procedure for a 
Poisson process with multiple changepoints. J Comput Graphical 
Statist 2001, 10:772-785. 

1 1 Yang TY: Bayesian binary segmentation procedure for detecting 
streakiness in sports. J RStatSocSer A 2004, 167:627-637. 

12. Pollack JR, SrlieT, Perou CM, Rees CA, Jeffrey SS, Lonning PE, Tibshirani R, 
Botstein D, Brresen-Dale AL, Brown PO: Microarray analysis reveals a 
major direct role of DNA copy number alteration in the 
transcriptional program of human breast tumors. Proc Natl Acad Sci 
USA 2002, 99:12963-12968. 

1 3. Bengtsson H, Ray A, Spellman P, Speed T: A single-sample method for 
normalizing and combining full-resolution copy numbers from 
multiple platforms, labs and analysis methods. Bioinformatics 2009, 
25:861-867. 

14. Zhang NR, Senbabaoglu Y, Li JZ: Joint estimation of DNA copy number 
from multiple platforms. Bioinformatics 201 0, 26:1 53-1 60. 

1 5. The TCGA Research Network: Comprehensive genomic 
characterization defines human glioblastoma genes and core 
pathways. Nature 2008, 455:1 061-1 068. 

1 6. Picard F, Robin S, Lavielle M, Vaisse C, Daudin JJ: A statistical approach 
for array CGH data analysis. BMC Bioinformatics 2005, 6:27. 

17. Hupe P, Stransky N.Thiery JP, Radvanyi F, Barillot E: Analysis of array 
CGH data: from signal ratio to gain and loss of DNA regions. 
Bioinformatics 2004, 20(1 8):341 3-3422. 

18 Lai W, Choudhary V, Park PJ: CGHweb: a tool for comparing DNA copy 
number segmentations from multiple algorithms. Bioinformatics 
2008, 24(7):1014-1015. 

r \ 

doi:10.1 186/1471-2105-13-277 

Cite this article as: Yang: Simple binary segmentation frameworks for 
identifying variation in DNA copy number. BMC Bioinformatics 2012 13:277. 
v ) 



Submit your next manuscript to BioMed Central 
and take full advantage of: 

• Convenient online submission 

• Thorough peer review 

• No space constraints or color figure charges 

• Immediate publication on acceptance 

• Inclusion in PubMed, CAS, Scopus and Google Scholar 

• Research which is freely available for redistribution 



Submit your manuscript at \ n - r t i 

www.biomedcentral.com/submit \J B' 01 ™*" central 



